Skip to content

feat(server): reduce layer-split activation memory with backend precision policy#310

Open
weicj wants to merge 4 commits into
Luce-Org:mainfrom
weicj:feat-backend-activation-precision-policy-after-306
Open

feat(server): reduce layer-split activation memory with backend precision policy#310
weicj wants to merge 4 commits into
Luce-Org:mainfrom
weicj:feat-backend-activation-precision-policy-after-306

Conversation

@weicj
Copy link
Copy Markdown
Collaborator

@weicj weicj commented May 29, 2026

Summary

This PR reduces long-context target layer-split OOM risk by storing layer-split activation staging buffers in a backend-appropriate dtype instead of always using F32. The graph still casts back to F32 at ggml RMSNorm / norm-weight boundaries, so the lower-precision storage does not change those operator contracts.

The allocation reduced here is hidden_size * context_tokens * bytes_per_element * staging_buffer_count. Qwen35 and Laguna currently use two staging buffers; Gemma4 uses three because its per-layer input path also keeps an original embedding staging buffer.

At a 256K context cap, the staging allocation changes from:

  • Qwen35 / Qwen3.6-27B / hidden=5120: 10240 MiB -> 5120 MiB.
  • Laguna-XS.2 / hidden=2048: 4096 MiB -> 2048 MiB.
  • Gemma4 31B / hidden=5376: 16128 MiB -> 8064 MiB.

Changes

  • Add BackendActivationPolicy in common/backend_precision for target layer-split activation staging.
  • Select activation dtype from backend architecture:
    • CUDA Ampere+ and HIP native-BF16 targets -> BF16.
    • CUDA tensor-core / GP100 and HIP gfx9/gfx10 targets without native BF16 -> F16.
    • Older or unknown CUDA/HIP targets -> F32.
  • Add LUCEBOX_LAYER_SPLIT_ACT_TYPE=f32|f16|bf16 as an explicit local override.
  • Extend shared layer-split activation buffers to allocate F32, F16, or BF16 tensors and upload F32 embeddings into the selected storage dtype.
  • Add common/ggml_graph_precision.h for graph-side F32 casts.
  • Make Qwen35, Gemma4, and Laguna RMSNorm / norm-weight graph boundaries F32-safe.
  • Wire Qwen35, Gemma4, and Laguna target layer-split adapters through the shared activation policy.
  • Add no-GPU unit coverage for the CUDA SM and HIP gfx dtype-selection tables.

Notes

easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record PR Luce-Org#310 integration, refreshed PR classification, retained probe worktrees, and validation outcomes for the unattended integration run.
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 30 files

Reply with feedback, questions, or to request a fix.

Re-trigger cubic

Comment thread server/src/qwen35/qwen35_layer_split_adapter.cpp Outdated
Comment thread server/src/gemma4/gemma4_layer_split_adapter.cpp Outdated
Comment thread server/src/laguna/laguna_target_loader.cpp
Comment thread server/src/laguna/laguna_layer_split_adapter.cpp Outdated
@weicj weicj force-pushed the feat-backend-activation-precision-policy-after-306 branch from e73c2e3 to bf9f4b5 Compare May 29, 2026 17:47
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Apply the updated Luce-Org#310 backend activation precision policy changes on top of the existing auto-integration conflict resolution, preserving the Qwen35 MoE persistent logits graph resolution already carried in the stack.
easel pushed a commit to easel/lucebox-hub that referenced this pull request May 29, 2026
Record the 2026-05-29 14:03 cron pass, confirming Luce-Org#309/Luce-Org#310 and all other current included PR heads remain ancestors of easel/auto-integration. Re-probe the remaining old non-draft PRs from the current integration tip and record their conflict sets and retained worktrees.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant